Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix:table detection #174

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

fix:table detection #174

wants to merge 4 commits into from

Conversation

hdoer
Copy link

@hdoer hdoer commented Aug 23, 2023

There are two functions cast the line's endpoints to int value for get the unique value. The two functions are
"_determine_number_of_rows_and_columns" and "_determine_table_cell_boundaries". This conversion causes "_is_unbroken" function run error.

I read this part of code. Usually to judging intersection between the lines based on their distance of the line's x / y value.

But the function "_determine_table_cell_boundaries" violate this convention for get the unique value. And then the function "_is_unbroken" use the converted int value.

This will throws an error: "A Rectangle must have a non-negative height" sometimes.

Example:
original:
xs: [110.81, 484.49, 110.81, 484.49, 110.81, 484.49, 111.05, 111.05, 207.35, 207.35, 262.85, 262.85, 315.65, 315.65, 430.25, 430.25, 484.25, 484.25]
ys: [526.19, 526.19, 557.89, 557.89, 647.49, 647.49, 525.95, 647.25, 525.95, 647.25, 525.95, 647.25, 525.95, 647.25, 525.95, 647.25, 525.95, 647.25]

sorted unique
xs: [110, 111, 207, 262, 315, 430, 484]
ys:[525, 526, 557, 647]

In addition, the code "min(l.y0, l.y1) <= r.get_y() and max(l.y0, l.y1) >= r.get_y() + r.get_height()" in function "_is_unbroken", r.get_y() used converted value but l.y0 / l.y1 use the original value. Converted value must <= original value, so the above code will return false aways when l and r has same endpoint.

The commit for fix the problem above.

@jorisschellekens
Copy link
Owner

Can you provide me with a PDF of where the previous code fails?

@hdoer
Copy link
Author

hdoer commented Aug 23, 2023

Sorry, I can't provide the original pdf document. Because the pdf involves privacy.

@hdoer
Copy link
Author

hdoer commented Aug 23, 2023

test.pdf
I draw a table in the test.pdf use reportlab. The table has 9 line segments. The problem mentioned above can be reproduced with the test.pdf.
The coordinates of these line segments are the same as those mentioned above.
But reportlab takes the lower left corner as the origin.

@Anwen954
Copy link

I encountered the same problem.There is my test PDF.
icbc.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants